Analysis of how chemical properties influence the quality of red wines by Meixian Chen
Introduction: This report uses R to quantitatively analyze how chemical properties influence the quality of red wines. The tidy wine data consist of 1599 red wines with 11 variables on the chemical properties of the wine. Each wine is rated by at least 3 wine experts, providing a rating between 0 (very bad) and 10 (very excellent).
Overview of the wine dataset
Here is part of the dataset, contains 1599 observations of 11 chemical properties and quality ratings of wines.
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Summary of statistics of each variables
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Visual distribution of each variables
The distribution of wine quality is close to normal distribution. Most wines are rated as average (5 or 6), there are a few bad wines (less than 5) and good wines (more than 7).
We continue to inspect the distribution of each of the 11 variables.
The distribution of fixed.acidity is close to normal distribution. The fixed acidity values above 12.5% are considered as outliers.
I remove the outlier points which has a close to 1.6% volatile acidity from the dataframe in the further analysis.
Different from the above two variables, the distribution of citric.acid has a peak value in 0, and the numbers decrease with the citric.acid values. I remove the outliers which value is 1.00.
Most of the residual.sugar level are between [0,4]. This distribution has a long and small tail.
The distribution of chlorides is very similar to residual.sugar, and both of them have a long tail of outliers. Are the outliers in the two plot from the same observation of the data?
The next two boxplots are plotting chlorides variable on the data set which the residual.sugar level is less than 3, and residual.sugar on the data set which the chlorides level is less than 1.2.
If the outliers are from some common points, we expect to see the change from the original plots. Since there is no obvious change on the new plots, the outliers are from different points. Moreover, the correlation value cor(red\(residual.sugar,red\)chlorides) = 0.059 is rather low. Similar distributions of two variables do not means there is a high correlationship between them.
## [1] 0.05371106
I remove outliers with values >200 in the further analysis.
The distribution of red wine density are in a small range between 0.99 g/ml to 1.01 g/ml (pure water is 1.00 g/ml). It matches what I expected.
Wine has pH between 2.9 to 4.
I remove outliers with sulphates values >1.5.
The alcohol level of red wines is almostly between 8~14 degrees.
The distributions of most variables are close to normal distribution: the majority of individuals are in the middle slots and fewer are in the low/ high slots.
We are mainly interested on which chemical properties affect the wine quality, and also how much they relate to each other. Here is matrix presenting the correlation values among the variables. Red color dots shows negative correlation and blue are positive correlation. The darkness and the size of the dots shows how strong is the correlation.
From the correlation matrix, quality is mainly related to alcohol and volatile.acidity levels, then to citric.acid and sulphates levels.
We plot the quality variable with the chemical properties which highly related to it.
By plotting the alcohol and quality, we can see better quality red wines tends to have higher alcohol levels. For average and good wines (rated more than 4), the mean of alcohol levels increases with the ranking categories.
Good wine tends to have low volatile acidity.
The third chemical property we try to plot is citric.acid. It is the third/fourth variable that relates to quality, and it is also strongly related to volatile.acidity. We would first provide bivariate plot of citric.acid and quality, and in the next section, further investigate the multi-variable relationship among quality and citric.acid and volatile.acidity.
Good wine tends to have higher citric acid.
The first multivariable plot is a point plot, using the two most strongest properties, alcohol and volatile acidity as axis, and quality as color to highlight the distribution of different categories of wines. The majority of good wines lie in the zone of higher alcohol value and lower volatile acidity value. While the bad wines are the opposite case.
The second multivariable plot uses citric acid and volatile acidity as axis, and quality as color to highlight the distribution of different categories of wines. The majority of good wines lie in the zone of higher citric acid value and higher volatile acidity value. While the bad wines are the opposite case. More interesting, wine with higher citric acid level tends to have also higher volatile acidity level.
The last plot we present is a 3D figure. We select only the very good (rating maximal 8) and very bad wine (rating minimal 4) from the dataset in order to have a clear image. The interesting fact we find is that, good wines usually have a lower volatile acidity, higher citric acid and higher alcohol level, while bad wines are the opposite.
To summary this report, we select three presentive plots, an one-variable plot showing the distribution of wine quality, a bi-variable plot showing how a chemical property influences the wine quality, and a multi-variable plot showing how chemical properties influences the wine quality and also how they relate to each other.
This first plot gives an overview of wine quality distribution. Wine quality distribution is close to normal distribution, and it is as we expect: the majority of wines on the market are average, very good and very bad wines are few.
The second plot shows how one of the most strongly influencing chemical property,alcohol, affects the wine quality. Good wines tend to have a high alcohol level in general.
The third plot shows how citric acid and volatile acidity influence the wine quality. Moreover these two variable is also related to each other. The majority of good wines lie in the zone of higher citric acid value and higher volatile acidity value. While the bad wines are the opposite case. More interesting, wine with higher citric acid level tends to have also higher volatile acidity level.
The most different part when starting the analysis was to decide starting from which variable. I wasted some time at the beginner while trying to plot insightful figures. Generating a correlation matrix among variable, or using ggpair library to analysis a subset of the dataset in advance of bi/multi-variable plotting is very useful. The analysis went more smooths by picking up the right combination of variables. Besides, I was also struggled on generating clear figures to show the patterns of the data, especially on the multi-variable plotting. By adopting the suggestion of draw linear regression line of different categories of wine qualities, it is easier to discover the patterns.
Wine tasting is a personal thing. Some people prefer some kinds of wines while the others have different opinions. One future work on the dataset analysis could be, storing separately the rating from different experts, and then to identify what is the common chemical properties of good wine for each individual. Thus, we can recommend good wine based on similar personal preference, rather than average opinions of some experts.